传统的域泛化旨在从多个域学习域不变表示,这需要准确的注释。然而,在现实的应用方案中,收集和注释大量数据太麻烦甚至不可行。然而,Web数据提供免费午餐,以便使用丰富的风格信息访问大量未标记的数据,这些数据可以利用增强域泛化能力。在本文中,我们介绍了一个新的任务,称为半监督域泛化,研究如何互动和未标记的域名,并建立两个基准,包括一个网上爬行数据集,它造成了一种新颖的但是逼真的挑战来推动现有技术的限制。为了解决这项任务,简单的解决方案是通过伪标记与域混淆训练一起传播标签到未标记的域的类信息。考虑缩小域间隙可以提高伪标签的质量和进一步推进域不变特征学习的泛化,我们提出了一个循环学习框架,以鼓励标签传播和域泛化之间的积极反馈,有利于桥接标记的不断发展的中间域课程学习方式的未标记域。进行实验以验证我们框架的有效性。值得突出显示的是,Web爬网数据受益于我们的结果中所示的域泛化。我们的代码稍后将提供。
translated by 谷歌翻译
Ultra-high resolution image segmentation has raised increasing interests in recent years due to its realistic applications. In this paper, we innovate the widely used high-resolution image segmentation pipeline, in which an ultra-high resolution image is partitioned into regular patches for local segmentation and then the local results are merged into a high-resolution semantic mask. In particular, we introduce a novel locality-aware context fusion based segmentation model to process local patches, where the relevance between local patch and its various contexts are jointly and complementarily utilized to handle the semantic regions with large variations. Additionally, we present the alternating local enhancement module that restricts the negative impact of redundant information introduced from the contexts, and thus is endowed with the ability of fixing the locality-aware features to produce refined results. Furthermore, in comprehensive experiments, we demonstrate that our model outperforms other state-of-the-art methods in public benchmarks. Our released codes are available at: https://github.com/liqiokkk/FCtL.
translated by 谷歌翻译
Crowd counting plays an important role in risk perception and early warning, traffic control and scene statistical analysis. The challenges of crowd counting in highly dense and complex scenes lie in the mutual occlusion of the human body parts, the large variation of the body scales and the complexity of imaging conditions. Deep learning based head detection is a promising method for crowd counting. However the highly concerned object detection networks cannot be well applied to this field for two main reasons. First, most of the existing head detection datasets are only annotated with the center points instead of bounding boxes which is mandatory for the canonical detectors. Second, the sample imbalance has not been overcome yet in highly dense and complex scenes because the existing loss functions calculate the positive loss at a single key point or in the entire target area with the same weight. To address these problems, We propose a novel loss function, called Mask Focal Loss, to unify the loss functions based on heatmap ground truth (GT) and binary feature map GT. Mask Focal Loss redefines the weight of the loss contributions according to the situ value of the heatmap with a Gaussian kernel. For better evaluation and comparison, a new synthetic dataset GTA\_Head is made public, including 35 sequences, 5096 images and 1732043 head labels with bounding boxes. Experimental results show the overwhelming performance and demonstrate that our proposed Mask Focal Loss is applicable to all of the canonical detectors and to various datasets with different GT. This provides a strong basis for surpassing the crowd counting methods based on density estimation.
translated by 谷歌翻译
Semantic segmentation is a high level computer vision task that assigns a label for each pixel of an image. It is challengeful to deal with extremely-imbalanced data in which the ratio of target ixels to background pixels is lower than 1:1000. Such severe input imbalance leads to output imbalance for poor model training. This paper considers three issues for extremely-imbalanced data: inspired by the region based loss, an implicit measure for the output imbalance is proposed, and an adaptive algorithm is designed for guiding the output imbalance hyperparameter selection; then it is generalized to distribution based loss for dealing with output imbalance; and finally a compound loss with our adaptive hyperparameter selection alogorithm can keep the consistency of training and inference for harmonizing the output imbalance. With four popular deep architectures on our private dataset with three input imbalance scales and three public datasets, extensive experiments demonstrate the ompetitive/promising performance of the proposed method.
translated by 谷歌翻译
Learning an explainable classifier often results in low accuracy model or ends up with a huge rule set, while learning a deep model is usually more capable of handling noisy data at scale, but with the cost of hard to explain the result and weak at generalization. To mitigate this gap, we propose an end-to-end deep explainable learning approach that combines the advantage of deep model in noise handling and expert rule-based interpretability. Specifically, we propose to learn a deep data assessing model which models the data as a graph to represent the correlations among different observations, whose output will be used to extract key data features. The key features are then fed into a rule network constructed following predefined noisy expert rules with trainable parameters. As these models are correlated, we propose an end-to-end training framework, utilizing the rule classification loss to optimize the rule learning model and data assessing model at the same time. As the rule-based computation is none-differentiable, we propose a gradient linking search module to carry the gradient information from the rule learning model to the data assessing model. The proposed method is tested in an industry production system, showing comparable prediction accuracy, much higher generalization stability and better interpretability when compared with a decent deep ensemble baseline, and shows much better fitting power than pure rule-based approach.
translated by 谷歌翻译
旨在在低维潜在空间中压缩量子信息的量子自动编码器位于量子信息领域的自动数据压缩的核心。在本文中,我们为给定的量子自动编码器建立了压缩率的上限,并提出了一种学习控制方法,用于训练自动编码器以达到最大压缩率。理论上使用特征分解和基质分化来证明压缩率的上限,这取决于输入状态的密度矩阵表示的特征值。提出了2 Q量和3 Q量系统的数值结果,以演示如何训练量子自动编码器以实现理论上最大的压缩,并比较使用不同的机器学习算法的训练性能。说明了使用量子光学系统的量子自动编码器的实验结果,以将两个2 Q Q Q Q Qubit的状态压缩为两个1 Quit状态。
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5$\rightarrow$Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
Increasing research interests focus on sequential recommender systems, aiming to model dynamic sequence representation precisely. However, the most commonly used loss function in state-of-the-art sequential recommendation models has essential limitations. To name a few, Bayesian Personalized Ranking (BPR) loss suffers the vanishing gradient problem from numerous negative sampling and predictionbiases; Binary Cross-Entropy (BCE) loss subjects to negative sampling numbers, thereby it is likely to ignore valuable negative examples and reduce the training efficiency; Cross-Entropy (CE) loss only focuses on the last timestamp of the training sequence, which causes low utilization of sequence information and results in inferior user sequence representation. To avoid these limitations, in this paper, we propose to calculate Cumulative Cross-Entropy (CCE) loss over the sequence. CCE is simple and direct, which enjoys the virtues of painless deployment, no negative sampling, and effective and efficient training. We conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness and efficiency of CCE. The results show that employing CCE loss on three state-of-the-art models GRU4Rec, SASRec, and S3-Rec can reach 125.63%, 69.90%, and 33.24% average improvement of full ranking NDCG@5, respectively. Using CCE, the performance curve of the models on the test data increases rapidly with the wall clock time, and is superior to that of other loss functions in almost the whole process of model training.
translated by 谷歌翻译